AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalised campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customers queries are resolved faster. Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help.
To identify different segments in the existing customer based on their spending patterns as well as past interaction with the bank.
import warnings
warnings.filterwarnings('ignore')
import itertools as it
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.display import display, HTML
from scipy.stats import zscore
from scipy.spatial.distance import cdist, pdist
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans, AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
%matplotlib inline
# Increase cell width
display(HTML("<style>.container { width:98% !important; }</style>"))
# Update seaborn default style
sns.set_style("ticks")
sns.set_context("notebook")
my_random_state = 5
credit_card_orig_df = pd.read_excel("Credit Card Customer Data.xlsx")
print("")
print("Credit Card Customer Data")
print("*************************")
print("")
display(HTML(credit_card_orig_df.head(10).to_html()))
print("")
print("Credit Card Customer Data -- Info")
print(credit_card_orig_df.info())
Based on the problem statement, the description of the columns are
Sl_No Serial Number for the entry Customer Key The customer key, to uniquely identify the customer Avg_Credit_Limit The average line of credit for the customer across their credit cards Total_Credit_Cards Total number of credit cards for the customer with the bank Total_visits_bank Total number of bank visits by the customer for queries Total_visits_online Total number of online visits by the customer for queries Total_calls_made Total number of calls made by the customer for queries
print("")
print("Credit Card Customer Data -- Shape")
print(credit_card_orig_df.shape)
credit_card_df = credit_card_orig_df.copy()
# Since we are looking for customer segmentation, we neednt have to worry about Sl_No and Customer Key and remove them beforehand
credit_card_df.drop(["Sl_No", "Customer Key"], axis=1, inplace=True)
print("")
print("Credit Card Customer Data -- Describe")
display(HTML(credit_card_df.describe().T.to_html()))
cols_colors = ["r","g","c","m","grey"]
cols_length = len(credit_card_df.columns)
for column, index, color in it.zip_longest(credit_card_df.columns, range(cols_length), cols_colors):
print("")
print("********************************************************* Analysis of : {0} *********************************************************".format(column))
print("")
minimum = credit_card_df[column].min()
maximum = credit_card_df[column].max()
min_max_range = maximum - minimum
# Find the null/na values, min, max and range of the values
print("Null Values? : {}".format(credit_card_df[column].isnull().any()))
print("Na Values? : {}".format(credit_card_df[column].isna().any()))
print("Min : {}".format(minimum))
print("Max : {}".format(maximum))
print("Range : {0:0.2f}".format(min_max_range))
print("--------------")
# Find the mean, median and std. dev
print("Mean : {0:0.2f}".format(credit_card_df[column].mean()))
print("Median : {0:0.2f}".format(credit_card_df[column].median()))
print("Std Dev : {0:0.2f}".format(credit_card_df[column].std()))
col_skew = credit_card_df[column].skew()
# Find the skew
col_skew_str = "positive skew" if col_skew > 0 else "negative skew"
print("Skew : {0:0.2f} ({1})".format(col_skew, col_skew_str))
print("--------------")
q1 = credit_card_df[column].quantile(q=0.25)
q3 = credit_card_df[column].quantile(q=0.75)
iqr = q3 - q1
# Find the outliers
print("Q1 (1st Quartile) : {}".format(q1))
print("Q3 (3rd Quartile) : {}".format(q3))
print("IQR : {0:0.2f}".format(iqr))
print("------- Outliers using IQR -------")
lower_whisker = q1 - 1.5 * (iqr)
upper_whisker = q3 + 1.5 * (iqr)
print("Lower whisker : {0:0.2f}".format(lower_whisker))
print("Upper whisker : {0:0.2f}".format(upper_whisker))
num_below_lower = credit_card_df[credit_card_df[column] < lower_whisker][column].count()
num_above_upper = credit_card_df[credit_card_df[column] > upper_whisker][column].count()
perc_below_lower = (num_below_lower / len(credit_card_df)) * 100
perc_above_upper = (num_above_upper / len(credit_card_df)) * 100
print("Number of lower outliers : {}".format(num_below_lower))
print("Number of upper outliers : {}".format(num_above_upper))
print("% of the lower outliers, in dataset : {0:0.2f}%".format(perc_below_lower))
print("% of the upper outliers, in dataset : {0:0.2f}%".format(perc_above_upper))
print("")
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(20,6), edgecolor='k')
#boxplot
sns.boxplot(x=column,data=credit_card_df,orient='v',ax=ax1, color=color)
ax1.set_ylabel(column, fontsize=15)
ax1.set_title('Box Plot of {}'.format(column), fontsize=15, color="navy")
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(credit_card_df[column],ax=ax2, color=color)
ax2.axvline(credit_card_df[column].mean(),linestyle="dashed",label="mean",color="k")
ax2.legend(loc="best")
ax2.set_xlabel(column, fontsize=15)
ax2.set_title('Distribution of {}'.format(column), fontsize=15, color="navy")
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(credit_card_df[column], color=color)
ax3.axvline(credit_card_df[column].mean(),linestyle="dashed",label="mean",color="k")
ax3.legend(loc="best")
ax3.set_xlabel(column, fontsize=15)
ax3.set_title('Histogram of {}'.format(column), fontsize=15, color="navy")
ax3.tick_params(labelsize=15)
plt.tight_layout(pad=4.5)
plt.show();
print("")
print("*************************************************************************************************************************************")
sns.pairplot(credit_card_df,diag_kind='kde')
plt.show();
# Graphical representation of the correlation
correlation = credit_card_df.corr()
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
fig_corr, axes_corr = plt.subplots(nrows=1, ncols=1, figsize=(12,10))
# Hide the upper half of the matrix - So that we see the values only once
mask = np.zeros_like(correlation)
mask[np.triu_indices_from(mask, 1)] = True
sns.heatmap(correlation, ax=axes_corr, linewidths=.5, cbar_kws={"shrink": .5}, cmap=cmap, vmin=-1, annot=True, fmt=".2f", mask=mask)
fig_corr.suptitle("Correlation Matrix", fontsize=14)
plt.yticks(fontsize=10, va="center")
fig_corr.tight_layout(pad=4.5)
plt.show();
# Plot a density plot of the number of credit cards vs credit limit
ax = sns.kdeplot(credit_card_df["Total_Credit_Cards"], credit_card_df["Avg_Credit_Limit"], cmap="Blues", shade=True)
ax.set_title("Total Credit Cards vs Avg Credit Limit", pad=15, color="navy")
plt.show();
# Plot a 3D space graph of all the possible visits and the avg credit limit
fig = plt.figure(figsize=(13,13))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(credit_card_df['Total_visits_bank'], credit_card_df['Total_visits_online'], credit_card_df['Total_calls_made'], c=credit_card_df["Avg_Credit_Limit"], s=100, cmap="viridis")
ax.view_init(15, 130)
ax.set_xlabel("Total Bank Visits")
ax.set_ylabel("Total Online Visits")
ax.set_zlabel("Total Calls Made")
lab = fig.colorbar(scatter, shrink=.5)
lab.set_label("Avg Credit Limit",fontsize = 15)
plt.title("Distribution of Bank Visits, Online Visits and Calls Made",color="navy")
plt.show();
Since Avg_Credit_Limit and Total_Credit_Cards are of different scale compared to other attributes, we will standardize the data
# Standardize data based on zscore
credit_card_scaled_df = credit_card_df.apply(zscore)
display(HTML(credit_card_scaled_df.head(10).to_html()))
# Make local copies of the dataframe
credit_card_kmeans_df = credit_card_df.copy()
credit_card_scaled_kmeans_df = credit_card_scaled_df.copy()
# Find distortion for various cluster sizes to find out the optimal cluster size
# Lets look at cluster size 1-10
clusters=range(1,10)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(credit_card_scaled_kmeans_df)
prediction=model.predict(credit_card_scaled_kmeans_df)
# Find the mean distortion for each cluster
meanDistortions.append(sum(np.min(cdist(credit_card_scaled_kmeans_df, model.cluster_centers_, 'euclidean'), axis=1)) / credit_card_scaled_kmeans_df.shape[0])
# Plot the distortions for each cluster size, to make an elbow plot
fig, ax1=plt.subplots(1, 1, figsize=(10,6), edgecolor='k')
ax1.plot(clusters, meanDistortions, 'bx-')
ax1.axvline(3,linestyle="dashed",label="cluster",color="k")
ax1.set_xlabel('k')
ax1.set_ylabel('Average distortion')
ax1.legend(loc="best")
ax1.set_title('Selecting k with the Elbow Method')
plt.show();
# With cluster size as 3, find the labels
kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state = my_random_state)
kmeans.fit(credit_card_scaled_kmeans_df)
# Check the number of data in each cluster
labels = kmeans.labels_
counts = np.bincount(labels[labels>=0])
print(counts)
# Except for Group 2, the distribution seems ok
# let us check the centers in each group
centroids = kmeans.cluster_centers_
centroid_df = pd.DataFrame(centroids, columns = list(credit_card_scaled_kmeans_df) )
centroid_df.transpose()
# Add group to original data
kmeans_predictions = kmeans.predict(credit_card_scaled_kmeans_df)
credit_card_kmeans_df["Group"] = kmeans_predictions
credit_card_scaled_kmeans_df["Group"] = kmeans_predictions
credit_card_kmeans_df['Group'] = credit_card_kmeans_df['Group'].astype('category')
credit_card_scaled_kmeans_df['Group'] = credit_card_scaled_kmeans_df['Group'].astype('category')
credit_card_kmeans_df.dtypes
# Lets look at the value counts
credit_card_kmeans_df['Group'].value_counts()
# Check the clusters in 3d space, to get a better sense of the clustering
import plotly.express as px
fig = px.scatter_3d(credit_card_kmeans_df, x='Total_visits_bank', y='Total_visits_online', z='Total_calls_made',color='Group', labels={"Total_visits_bank":"In Person", "Total_visits_online":"Online", "Total_calls_made":"Calls"})
fig.show();
# Lets check the pairplot taking into consideration the new labels
sns.pairplot(credit_card_kmeans_df,hue="Group")
plt.show();
# Lets look at the boxplots to find the distributions of the attributes in each group
credit_card_scaled_kmeans_df.boxplot(by = 'Group', layout=(1,5), figsize=(22, 5))
plt.suptitle("Distribution of attributes by Group")
plt.show();
Legend : ( ⬆ - Most ⇩ - Less ⬇ - Least )
Customer segements are:
Group 0
Group 1
Group 2
# Create a dataframe with the observation for future analysis
kmeans_result = {
"Group": ["Group 0", "Group 1", "Group 2"],
"Avg Credit Limit": ["Least", "Less", "Most"],
"Number of Credit Cards": ["Least", "Less", "Most"],
"Call": ["Most", "Less", "Least"],
"Visit Bank": ["Less", "Most", "Least"],
"Visit Online": ["Less", "Least", "Most"]
}
kmeans_result_df = pd.DataFrame(kmeans_result).set_index("Group")
credit_card_hierarchical_df = credit_card_df.copy()
credit_card_scaled_hierarchical_df = credit_card_scaled_df.copy()
# For each link method find the cophentic coeff, dendogram
# Also for each linkage type, for cluster size of 2-10, find the silhouette score and record in a separate dataframe
link_methods = ['single', 'complete', 'average', 'ward']
clusters = range(2,10)
hierarchical_results = []
for l in link_methods:
print("")
print("********************************************************* Method : {0} *********************************************************".format(l))
Z = linkage(credit_card_scaled_hierarchical_df, method=l, metric='euclidean')
cc, cophn_dist = cophenet(Z, pdist(credit_card_scaled_hierarchical_df))
print("")
print("Cophenetic Coeff : {}".format(cc))
plt.figure(figsize=(25, 10))
plt.title("{0} : Dendogram".format(l))
plt.ylabel('Distance')
dendrogram(Z)
plt.show();
for cluster_size in clusters:
agglomerative = AgglomerativeClustering(linkage=l, affinity='euclidean',n_clusters=cluster_size).fit_predict(credit_card_scaled_hierarchical_df)
sil_score = silhouette_score(credit_card_scaled_hierarchical_df, agglomerative, metric='euclidean')
hierarchical_results.append((l, cc, cluster_size, sil_score))
print("")
print("*************************************************************************************************************************************")
hierarchical_results_df = pd.DataFrame(hierarchical_results, columns=["Linkage Method", "Cophenetic Coeff", "Number of Cluster", "Silhouette Score"])
display(HTML(hierarchical_results_df.sort_values(["Cophenetic Coeff", "Silhouette Score"], ascending=False).to_html()))
# For max distance of 3.2 and linkage as average, lets see the clusters made
max_distance = 3.2
agglomerative = AgglomerativeClustering(n_clusters=None, distance_threshold=max_distance, affinity='euclidean', linkage='average')
agglomerative.fit(credit_card_scaled_hierarchical_df)
# Check the number of data in each cluster
labels = agglomerative.labels_
counts = np.bincount(labels[labels>=0])
print(counts)
# Add group to original data
credit_card_hierarchical_df["Group"] = agglomerative.labels_
credit_card_scaled_hierarchical_df["Group"] = agglomerative.labels_
credit_card_hierarchical_df['Group'] = credit_card_hierarchical_df['Group'].astype('category')
credit_card_scaled_hierarchical_df['Group'] = credit_card_scaled_hierarchical_df['Group'].astype('category')
credit_card_hierarchical_df.dtypes
# Lets look at the value counts
credit_card_hierarchical_df['Group'].value_counts()
# Check the clusters in 3d space, to get a better sense of the clustering
import plotly.express as px
fig = px.scatter_3d(credit_card_hierarchical_df, x='Total_visits_bank', y='Total_visits_online', z='Total_calls_made',color='Group', labels={"Total_visits_bank":"In Person", "Total_visits_online":"Online", "Total_calls_made":"Calls"})
fig.show();
# Lets look at the boxplots to find the distributions of the attributes in each group
credit_card_scaled_hierarchical_df.boxplot(by = 'Group', layout=(1,5), figsize=(22, 5))
plt.suptitle("Distribution of attributes by Group")
plt.show();
Legend : ( ⬆ - Most ⇩ - Less ⬇ - Least )
Customer segements are:
Customer segements are:
Group 0
Group 1
Group 2
## K Mean
kmeans_silhouette_avg = silhouette_score(credit_card_scaled_kmeans_df.drop(columns=["Group"]), kmeans_predictions, metric='euclidean')
# Hierarchical
hierarchical_silhoutte_avg = silhouette_score(credit_card_scaled_hierarchical_df.drop(columns=["Group"]), agglomerative.labels_, metric='euclidean')
print("K-Means Clustering Silhouette Score : {0}".format(kmeans_silhouette_avg))
print("Hierarchical Clustering Silhouette Score : {0}".format(hierarchical_silhoutte_avg))
display(HTML(credit_card_kmeans_df.groupby("Group").describe().to_html()))
display(HTML(credit_card_hierarchical_df.groupby("Group").describe().to_html()))
Since the cluster (groups) namings are different for each method, we will pick the Grouping based on K-Means for analysis
Based on K-Means, the customer groups are:
display(HTML(kmeans_result_df.to_html()))
There are 3 segments of customers in the dataset provided
Based on the cluster analysis above, we can determine
- On a broad level, the segments are primarily based on whether the customer called the bank more often, or visited the bank most often or visited online most often
- Customers who most often call the bank, have the least average credit limit and least number of credit cards
- Customers who most often visited the bank in person, have less average credit limit and less number of credit cards
- Customers who most often visited the bank online, have the most average credit limit and most number of credit cards.
For Group 0
They have the least number of credit cards and least average credit limit. But they call the bank often. From a pure economics standpoint, this could be non beneficial to the bank. As they are spending resources on customers who dont have high avg credit limit or number of credit cards. But this could also mean the customer have very genuine issues that need to be addressed. So bank can first focus on reaching out to these customer and addressing any open issues and thus increase the customer's trust in the bank. Then for these customers, the bank can run phone campaigns to make them get more credit cards from the bankFor Group 1
They have less number of credit cards and less average credit limit. But they visit the bank more often and have the least online visits. So for these customers, to improve bank's resource utilization, the bank can run targetted campaigns in person when they visit, to educate them on the usefulness and convenience of online visits. This will save bank on-premises resources. Customers also dont have to wait in person, in queue at the bank. Rather get service from comfort of their home. And this will improve customer satisfactionFor Group 2
They have the most number of credit cards and high average credit limit. They also finish all their dealing with the bank online. So they are well informed and sort of gold members for the bank. So this is a good opportunity for the bank to target these customers with online campaigns for other products like home loan, term deposit etc which can potentially increase bank's revenue.